adding relations #158

kyleclo · 2022-10-11T17:28:08Z

This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called key and value.

A few things needed to change to support Relations:

`Annotation Names`

Relations store references to Annotation objects. But we didn't want Relation.to_json() to also .to_json() those objects. We only want to store minimal identifiers of the key and value. Something short like bib_entry-5 or sentence-13. We call these short strings names.

To do this, we added to Annotation class, an optional attribute called field: str which stores this name. It's automatically populated when you run Document.annotate(new_field = list_of_annotations); each of those input annotations will have the new field name stored under .field.

We also added a method name() that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores .field and .id.

In short, now after you annotate a Document with annotations, you can do stuff like:

doc.tokens[15].name   ==   AnnotationName(field='tokens', id=15)
str(annotation_name)  ==   'tokens-15'
AnnotationName.from_str('tokens-15')  ==  AnnotationName(field='tokens', id=15)

Lookups based on names

To support reconstructing a Relation object given the names of key and value, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:

annotation_name = AnnotationName.from_str('paragraphs-99')
a = document.locate_annotation( annotation_name )   -->  returns the specific Annotation object
assert a.id == 99
assert a.field == 'paragraphs'

to and from JSON

Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:

{'key': <name_of_key>, 'value': <name_of_value>, ...other stuff that all Annotation objects have,  like Metadata...}

Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.

The API for this is similar, but you must also pass in the Document object:

relation = Relation.from_json(my_relation_dict, my_document_containing_necessary_fields)

… the same object, just the values need to be the same after to/from json

…t instantiate spangroups

…; base Relation class on storage of these names; define to and from_json

soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

soldni · 2022-10-15T01:00:33Z

mmda/types/annotation.py

+        # TODO[kylel] - when does this ever get called? infinite loop?
        return self.__getattribute__(field)


This SO answer has a good explainer. I don't think you want to keep this here--- __getattribute__ is called before __getattr__.

soldni · 2022-10-15T01:03:14Z

mmda/types/annotation.py

@@ -142,47 +178,38 @@ def __deepcopy__(self, memo):

    @property
    def type(self) -> str:
+        logging.warning(msg='`.type` to be deprecated in future versions. Use `.metadata.type`')


nit: would not use the root logger. I would add

Logger = logger.getLogger(__file__)

after imports, and then call Logger.warning instead.

soldni · 2022-10-16T23:12:02Z

mmda/types/annotation.py

+        # TODO[kylel] - when does this ever get called? infinite loop?
        return self.__getattribute__(field)


You probably can remove it; this stack overflow post has a good explanation, but tl;dr is that __getattribute__ supersedes __getattr__ if defined, so you will never be in a situation where __getattribute__ should be called after __getattr__.

I suspect that the function of this call here is to raise an error if field is not defined; if that is the case, I would explicitly raise an AttributeError instead with a more descriptive error message.

soldni · 2022-10-16T23:12:53Z

mmda/types/annotation.py

            boxes: List[Box],
            id: Optional[int] = None,
            doc: Optional['Document'] = None,
+            field: Optional[str] = None,
            metadata: Optional[Metadata] = None,


Not related to this PR specifically, but I think we should document these arguments in a docstring.

yea we'll need that

soldni · 2022-10-16T23:20:07Z

mmda/types/document.py


        return doc
+
+    def locate_annotation(self, name: AnnotationName) -> Annotation:


I am not quite sure on why we need AnnotationName. If a name is just a combo of relation name + id, can we just expect a tuple of the two here? or require two args? Overall, having a class to just hold annotation names would just result in computational overhead.

see comment below

kyleclo · 2022-10-21T18:20:36Z

@soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's field_name - integer_id, but it's possible in the future this will need to be extended.

As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want field, id = obj.split('-') everywhere throughout the code as it gets hard to maintain in case we ever change something. The class allows us to have methods .field and .id for use here.

soldni · 2022-10-21T21:44:11Z

mmda/types/annotation.py

+    def from_str(cls, s: str) -> 'AnnotationName':
+        field, id = s.split('-')
+        id = int(id)
+        return AnnotationName(field=field, id=id)


Suggested change

return AnnotationName(field=field, id=id)

return cls(field=field, id=id)

Prevents issues when inheriting (if we ever decide to).

soldni · 2022-10-21T21:47:16Z

mmda/types/annotation.py

@@ -34,6 +35,22 @@ def warn_deepcopy_of_annotation(obj: "Annotation") -> None:
    warnings.warn(msg, UserWarning, stacklevel=2)


+class AnnotationName:
+    """Stores a name that uniquely identifies this Annotation within a Document"""
+


Suggested change

__slots__ = ("field", "id")

Speeds up class creation by roughly 20%:

>>> %timeit AnnotationName('relation', 1) 140 ns ± 0.391 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) # after adding slots >>> %timeit AnnotationName('relation', 1) 115 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

soldni · 2022-10-21T21:47:57Z

@soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's field_name - integer_id, but it's possible in the future this will need to be extended.

As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want field, id = obj.split('-') everywhere throughout the code as it gets hard to maintain in case we ever change something. The class allows us to have methods .field and .id for use here.

@kyleclo Sounds good! added two small suggestions to improve it, but otherwise ok to merge!

kyleclo added 19 commits October 5, 2022 21:53

simplify annotation

d9a64ac

remove unused imports

b17bbb7

remove metadata anno in favor of getter/setters

3132290

remove spangroup nesting

055dfea

fix grobid parser tests

813e134

fix tests for dict word predictor

affc3e8

fix bug in grobidparser

87de5c9

fix bug; forgot to remove type from init in boxgroup

92b79d4

make tests in json conversion more lenient; doesnt need to be exactly…

70bc6bf

… the same object, just the values need to be the same after to/from json

oops forgot to commit

34d6fc6

change internal ai2 test api

32a2e11

bugfix; vila model SpanGroup creation had type

d763d09

modify metadata constructor to take args; adjust all tests/models tha…

98da4ea

…t instantiate spangroups

add basic relation json conversion; WIP

9d09af7

WIP; for relations, handle field-aware id

e827f02

merge

8490804

wip; minor cleanup

2577c7d

add AnnotationName class; add lookup method to Document based on name…

e5d1d28

…; base Relation class on storage of these names; define to and from_json

add relation test; remove ability to create relation from json

f83eb00

kyleclo requested review from soldni, cmwilhelm and rauthur and removed request for soldni and rauthur October 14, 2022 01:07

kyleclo added 2 commits October 13, 2022 18:08

remove unused ways to from JSON for relations

84682ee

Merge branch 'main' into kylel/2022-10/relation

b93462c

soldni reviewed Oct 16, 2022

View reviewed changes

merge conflicts

24f0c39

kyleclo changed the title ~~[wip] adding relations~~ adding relations Oct 21, 2022

kyleclo requested review from egork520 and geli-gel October 21, 2022 06:37

kyleclo added 2 commits October 20, 2022 23:39

replace getattribute with getattr

818349b

return empty list if no getattr match

5820934

soldni approved these changes Oct 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding relations #158

adding relations #158

kyleclo commented Oct 11, 2022 •

edited

Loading

soldni left a comment

soldni Oct 15, 2022

kyleclo Oct 21, 2022

soldni Oct 15, 2022

soldni Oct 16, 2022

kyleclo Oct 21, 2022

soldni Oct 16, 2022

kyleclo Oct 21, 2022

soldni Oct 16, 2022

kyleclo Oct 21, 2022

kyleclo commented Oct 21, 2022

soldni Oct 21, 2022

soldni Oct 21, 2022

soldni commented Oct 21, 2022

		# TODO[kylel] - when does this ever get called? infinite loop?
		return self.__getattribute__(field)


		return doc

		def locate_annotation(self, name: AnnotationName) -> Annotation:

	return AnnotationName(field=field, id=id)
	return cls(field=field, id=id)

adding relations #158

Are you sure you want to change the base?

adding relations #158

Conversation

kyleclo commented Oct 11, 2022 • edited Loading

Annotation Names

Lookups based on names

to and from JSON

soldni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleclo commented Oct 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soldni commented Oct 21, 2022

kyleclo commented Oct 11, 2022 •

edited

Loading

`Annotation Names`